Read the data and visualize and get familiar with the variables.
datapath<-"~/Documents/UChicago/Courses/Statistical Analysis/Final Project"
AssignmentData<-
read.csv(file=paste(datapath,"regressionassignmentdata2014.csv",sep="/"),
row.names=1,header=TRUE,sep=",")
head(AssignmentData)
## USGG3M USGG6M USGG2YR USGG3YR USGG5YR USGG10YR USGG30YR Output1
## 1/5/1981 13.52 13.09 12.289 12.28 12.294 12.152 11.672 18.01553
## 1/6/1981 13.58 13.16 12.429 12.31 12.214 12.112 11.672 18.09140
## 1/7/1981 14.50 13.90 12.929 12.78 12.614 12.382 11.892 19.44731
## 1/8/1981 14.76 14.00 13.099 12.95 12.684 12.352 11.912 19.74851
## 1/9/1981 15.20 14.30 13.539 13.28 12.884 12.572 12.132 20.57204
## 1/12/1981 15.22 14.23 13.179 12.94 12.714 12.452 12.082 20.14218
## Easing Tightening
## 1/5/1981 NA NA
## 1/6/1981 NA NA
## 1/7/1981 NA NA
## 1/8/1981 NA NA
## 1/9/1981 NA NA
## 1/12/1981 NA NA
The first 7 variables (input variables) are the daily records of the US Treasury yields to maturity. Plot the input variables as below.
matplot(AssignmentData[,-c(8,9,10)],type='l')
nn <- ncol(AssignmentData)
legend("topright", colnames(AssignmentData),col=seq_len(nn),cex=0.8,fill=seq_len(nn))
Plot the input variables together with the output variable.The meaning of the variable Output will become clear later.
matplot(AssignmentData[,-c(9,10)],type='l')
nn <- ncol(AssignmentData)
legend("topright", colnames(AssignmentData),col=seq_len(nn),cex=0.8,fill=seq_len(nn))
Estimate simple regression model with each of the input variables and the output variable given in AssignmentData.
Input1.linear.Model = lm(AssignmentData$Output1~AssignmentData$USGG3M)
Input2.linear.Model = lm(AssignmentData$Output1~AssignmentData$USGG6M)
Input3.linear.Model = lm(AssignmentData$Output1~AssignmentData$USGG2YR)
Input4.linear.Model = lm(AssignmentData$Output1~AssignmentData$USGG3YR)
Input5.linear.Model = lm(AssignmentData$Output1~AssignmentData$USGG5YR)
Input6.linear.Model = lm(AssignmentData$Output1~AssignmentData$USGG10YR)
Input7.linear.Model = lm(AssignmentData$Output1~AssignmentData$USGG30YR)
Check available names of fields returned by lm() and summary() functions, if necessary. Analyze the summary.
Check relevance of the estimated parameters and the model as a whole, amount of correlation explained. Store coefficients for each input variable.
The following code gives an example of the analysis for the first input variable.
summary(Input1.linear.Model)
##
## Call:
## lm(formula = AssignmentData$Output1 ~ AssignmentData$USGG3M)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9374 -1.2115 -0.0528 1.2640 7.7048
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11.72318 0.03137 -373.7 <2e-16 ***
## AssignmentData$USGG3M 2.50756 0.00541 463.5 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.69 on 8298 degrees of freedom
## Multiple R-squared: 0.9628, Adjusted R-squared: 0.9628
## F-statistic: 2.148e+05 on 1 and 8298 DF, p-value: < 2.2e-16
c(Total.Variance=var(AssignmentData[,8]),Unexplained.Variance=summary(Input1.linear.Model)$sigma^2)
## Total.Variance Unexplained.Variance
## 76.804438 2.857058
Coefficients.Input1 = Input1.linear.Model$coefficients
Coefficients.Input1
## (Intercept) AssignmentData$USGG3M
## -11.723184 2.507561
matplot(AssignmentData[,8],type="l",xaxt="n",col = "red")
lines(Input1.linear.Model$fitted.values,col="black")
Second input, USGG6M.
summary(Input2.linear.Model)
##
## Call:
## lm(formula = AssignmentData$Output1 ~ AssignmentData$USGG6M)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7529 -1.0385 0.0224 1.1443 4.1509
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12.097528 0.026469 -457.0 <2e-16 ***
## AssignmentData$USGG6M 2.497235 0.004445 561.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.403 on 8298 degrees of freedom
## Multiple R-squared: 0.9744, Adjusted R-squared: 0.9744
## F-statistic: 3.157e+05 on 1 and 8298 DF, p-value: < 2.2e-16
c(Total.Variance=var(AssignmentData[,8]),Unexplained.Variance=summary(Input2.linear.Model)$sigma^2)
## Total.Variance Unexplained.Variance
## 76.804438 1.967321
Coefficients.Input2 = Input2.linear.Model$coefficients
Coefficients.Input2
## (Intercept) AssignmentData$USGG6M
## -12.097528 2.497235
matplot(AssignmentData[,8],type="l",xaxt="n",col = "red")
lines(Input2.linear.Model$fitted.values,col="grey")
Third input, USGG2YR.
summary(Input3.linear.Model)
##
## Call:
## lm(formula = AssignmentData$Output1 ~ AssignmentData$USGG2YR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.43277 -0.38356 -0.00578 0.43362 1.72564
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -13.055775 0.010031 -1302 <2e-16 ***
## AssignmentData$USGG2YR 2.400449 0.001532 1567 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5087 on 8298 degrees of freedom
## Multiple R-squared: 0.9966, Adjusted R-squared: 0.9966
## F-statistic: 2.455e+06 on 1 and 8298 DF, p-value: < 2.2e-16
c(Total.Variance=var(AssignmentData[,8]),Unexplained.Variance=summary(Input3.linear.Model)$sigma^2)
## Total.Variance Unexplained.Variance
## 76.8044379 0.2588092
Coefficients.Input3 = Input3.linear.Model$coefficients
Coefficients.Input3
## (Intercept) AssignmentData$USGG2YR
## -13.055775 2.400449
matplot(AssignmentData[,8],type="l",xaxt="n",col = "red")
lines(Input3.linear.Model$fitted.values,col="green")
Fourth input, USGG3YR.
summary(Input4.linear.Model)
##
## Call:
## lm(formula = AssignmentData$Output1 ~ AssignmentData$USGG3YR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0160 -0.2459 0.0325 0.2638 3.0666
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -13.861618 0.008214 -1688 <2e-16 ***
## AssignmentData$USGG3YR 2.455793 0.001230 1996 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3996 on 8298 degrees of freedom
## Multiple R-squared: 0.9979, Adjusted R-squared: 0.9979
## F-statistic: 3.984e+06 on 1 and 8298 DF, p-value: < 2.2e-16
c(Total.Variance=var(AssignmentData[,8]),Unexplained.Variance=summary(Input4.linear.Model)$sigma^2)
## Total.Variance Unexplained.Variance
## 76.804438 0.159657
Coefficients.Input4 = Input4.linear.Model$coefficients
Coefficients.Input4
## (Intercept) AssignmentData$USGG3YR
## -13.861618 2.455793
matplot(AssignmentData[,8],type="l",xaxt="n",col = "red")
lines(Input4.linear.Model$fitted.values,col="blue")
Fifth input, USGG5YR.
summary(Input5.linear.Model)
##
## Call:
## lm(formula = AssignmentData$Output1 ~ AssignmentData$USGG5YR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6517 -0.6216 -0.0147 0.6523 3.1720
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15.436649 0.017819 -866.3 <2e-16 ***
## AssignmentData$USGG5YR 2.568742 0.002581 995.2 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7989 on 8298 degrees of freedom
## Multiple R-squared: 0.9917, Adjusted R-squared: 0.9917
## F-statistic: 9.904e+05 on 1 and 8298 DF, p-value: < 2.2e-16
c(Total.Variance=var(AssignmentData[,8]),Unexplained.Variance=summary(Input5.linear.Model)$sigma^2)
## Total.Variance Unexplained.Variance
## 76.8044379 0.6382572
Coefficients.Input5 = Input5.linear.Model$coefficients
Coefficients.Input5
## (Intercept) AssignmentData$USGG5YR
## -15.436649 2.568742
matplot(AssignmentData[,8],type="l",xaxt="n",col = "red")
lines(Input5.linear.Model$fitted.values,col="cyan")
Sixth input, USGG10YR.
summary(Input6.linear.Model)
##
## Call:
## lm(formula = AssignmentData$Output1 ~ AssignmentData$USGG10YR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0334 -1.2810 -0.1944 1.3561 4.2254
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -18.063370 0.039182 -461.0 <2e-16 ***
## AssignmentData$USGG10YR 2.786991 0.005455 510.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.538 on 8298 degrees of freedom
## Multiple R-squared: 0.9692, Adjusted R-squared: 0.9692
## F-statistic: 2.61e+05 on 1 and 8298 DF, p-value: < 2.2e-16
c(Total.Variance=var(AssignmentData[,8]),Unexplained.Variance=summary(Input6.linear.Model)$sigma^2)
## Total.Variance Unexplained.Variance
## 76.804438 2.366752
Coefficients.Input6 = Input6.linear.Model$coefficients
Coefficients.Input6
## (Intercept) AssignmentData$USGG10YR
## -18.063370 2.786991
matplot(AssignmentData[,8],type="l",xaxt="n",col = "red")
lines(Input6.linear.Model$fitted.values,col="Magenta")
Seventh input, USGG30YR.
summary(Input7.linear.Model)
##
## Call:
## lm(formula = AssignmentData$Output1 ~ AssignmentData$USGG30YR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.6447 -1.8426 -0.4731 1.9529 4.8734
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -21.08591 0.06560 -321.4 <2e-16 ***
## AssignmentData$USGG30YR 3.06956 0.00886 346.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.229 on 8298 degrees of freedom
## Multiple R-squared: 0.9353, Adjusted R-squared: 0.9353
## F-statistic: 1.2e+05 on 1 and 8298 DF, p-value: < 2.2e-16
c(Total.Variance=var(AssignmentData[,8]),Unexplained.Variance=summary(Input7.linear.Model)$sigma^2)
## Total.Variance Unexplained.Variance
## 76.804438 4.967286
Coefficients.Input7 = Input7.linear.Model$coefficients
Coefficients.Input7
## (Intercept) AssignmentData$USGG30YR
## -21.085905 3.069561
matplot(AssignmentData[,8],type="l",xaxt="n",col = "red")
lines(Input7.linear.Model$fitted.values,col="yellow")
Collect all slopes and intercepts in one table and print this table. Try to do it in one line using apply() function.
lms.coefficients<- apply(AssignmentData[,1:7],2,function(z) lm(AssignmentData$Output1 ~ z)$coefficients)
lms.coefficients
## USGG3M USGG6M USGG2YR USGG3YR USGG5YR
## (Intercept) -11.723184 -12.097528 -13.055775 -13.861618 -15.436649
## z 2.507561 2.497235 2.400449 2.455793 2.568742
## USGG10YR USGG30YR
## (Intercept) -18.063370 -21.085905
## z 2.786991 3.069561
#alternative
#lms.coefficients<- sapply(1:7, function(z) lm(AssignmentData$Output1 ~ AssignmentData[,z])$coefficients)
Fit linear regression models using single output (column 8 Output1) as input and each of the original inputs as outputs.
RInput1.linear.Model = lm(AssignmentData$USGG3M ~ AssignmentData$Output1)
RInput2.linear.Model = lm(AssignmentData$USGG6M ~ AssignmentData$Output1)
RInput3.linear.Model = lm(AssignmentData$USGG2YR ~ AssignmentData$Output1)
RInput4.linear.Model = lm(AssignmentData$USGG3YR ~ AssignmentData$Output1)
RInput5.linear.Model = lm(AssignmentData$USGG5YR ~ AssignmentData$Output1)
RInput6.linear.Model = lm(AssignmentData$USGG10YR ~ AssignmentData$Output1)
RInput7.linear.Model = lm(AssignmentData$USGG30YR ~ AssignmentData$Output1)
Collect all slopes and intercepts in one table and print this table.
reverse.lms.coefficients<- sapply(1:7, function(z) lm(AssignmentData[,z] ~ AssignmentData$Output1)$coefficients)
reverse.lms.coefficients
## [,1] [,2] [,3] [,4] [,5]
## (Intercept) 4.6751341 4.844370 5.4388879 5.6444580 6.009421
## AssignmentData$Output1 0.3839609 0.390187 0.4151851 0.4063541 0.386061
## [,6] [,7]
## (Intercept) 6.4813160 6.8693554
## AssignmentData$Output1 0.3477544 0.3047124
Estimate logistic regression using all inputs and the data on FED tightening and easing cycles.
AssignmentDataLogistic<-data.matrix(AssignmentData,rownames.force="automatic")
Prepare the easing-tightening data. Make the easing column equal to 0 during the easing periods and NA otherwise. Make the tightening column equal to 1 during the tightening periods and NA otherwise.
# Create columns of easing periods (as 0s) and tightening periods (as 1s)
EasingPeriods<-AssignmentDataLogistic[,9]
EasingPeriods[AssignmentDataLogistic[,9]==1]<-0
TighteningPeriods<-AssignmentDataLogistic[,10]
# Check easing and tightening periods
cbind(EasingPeriods,TighteningPeriods)[c(550:560,900:910,970:980),]
## EasingPeriods TighteningPeriods
## 3/29/1983 NA NA
## 3/30/1983 NA NA
## 3/31/1983 NA NA
## 4/4/1983 NA 1
## 4/5/1983 NA 1
## 4/6/1983 NA 1
## 4/7/1983 NA 1
## 4/8/1983 NA 1
## 4/11/1983 NA 1
## 4/12/1983 NA 1
## 4/13/1983 NA 1
## 8/27/1984 NA 1
## 8/28/1984 NA 1
## 8/29/1984 NA 1
## 8/30/1984 NA 1
## 8/31/1984 NA 1
## 9/4/1984 NA NA
## 9/5/1984 NA NA
## 9/6/1984 NA NA
## 9/7/1984 NA NA
## 9/10/1984 NA NA
## 9/11/1984 NA NA
## 12/10/1984 0 NA
## 12/11/1984 0 NA
## 12/12/1984 0 NA
## 12/13/1984 0 NA
## 12/14/1984 0 NA
## 12/17/1984 0 NA
## 12/18/1984 0 NA
## 12/19/1984 0 NA
## 12/20/1984 0 NA
## 12/21/1984 0 NA
## 12/24/1984 0 NA
Remove the periods of neither easing nor tightening.
All.NAs<-is.na(EasingPeriods)&is.na(TighteningPeriods)
AssignmentDataLogistic.EasingTighteningOnly<-AssignmentDataLogistic
AssignmentDataLogistic.EasingTighteningOnly[,9]<-EasingPeriods
AssignmentDataLogistic.EasingTighteningOnly<-AssignmentDataLogistic.EasingTighteningOnly[!All.NAs,]
AssignmentDataLogistic.EasingTighteningOnly[is.na(AssignmentDataLogistic.EasingTighteningOnly[,10]),10]<-0 #without this, there still are lots of tightening data = NA, they need to be changed to 0
matplot(AssignmentDataLogistic.EasingTighteningOnly[,-c(9,10)],type="l",ylab="Data and Binary Fed Mode")
lines(AssignmentDataLogistic.EasingTighteningOnly[,10]*20,col="red") #times 20 for viewing
#only tightening
Estimate logistic regression with 3M yields as predictors for easing/tightening output.
LogisticModel.TighteningEasing_3M<-glm(AssignmentDataLogistic.EasingTighteningOnly[,10]~ AssignmentDataLogistic.EasingTighteningOnly[,1],family=binomial(link=logit))
summary(LogisticModel.TighteningEasing_3M)
##
## Call:
## glm(formula = AssignmentDataLogistic.EasingTighteningOnly[, 10] ~
## AssignmentDataLogistic.EasingTighteningOnly[, 1], family = binomial(link = logit))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4239 -0.9014 -0.7737 1.3548 1.6743
##
## Coefficients:
## Estimate Std. Error
## (Intercept) -2.15256 0.17328
## AssignmentDataLogistic.EasingTighteningOnly[, 1] 0.18638 0.02144
## z value Pr(>|z|)
## (Intercept) -12.422 <2e-16 ***
## AssignmentDataLogistic.EasingTighteningOnly[, 1] 8.694 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2983.5 on 2357 degrees of freedom
## Residual deviance: 2904.8 on 2356 degrees of freedom
## AIC: 2908.8
##
## Number of Fisher Scoring iterations: 4
matplot(AssignmentDataLogistic.EasingTighteningOnly[,-c(9,10)],type="l",ylab="Data and Fitted Values")
lines(AssignmentDataLogistic.EasingTighteningOnly[,10]*20,col="red")
lines(LogisticModel.TighteningEasing_3M$fitted.values*20,col="green")
Now use all inputs as predictors for logistic regression.
LogisticModel.TighteningEasing_All<-glm(AssignmentDataLogistic.EasingTighteningOnly[,10]~ AssignmentDataLogistic.EasingTighteningOnly[,-c(8,9,10)],family=binomial(link=logit))
summary(LogisticModel.TighteningEasing_All)
##
## Call:
## glm(formula = AssignmentDataLogistic.EasingTighteningOnly[, 10] ~
## AssignmentDataLogistic.EasingTighteningOnly[, -c(8, 9, 10)],
## family = binomial(link = logit))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2113 -0.8595 -0.5935 1.1306 2.5530
##
## Coefficients:
## Estimate
## (Intercept) -4.7552
## AssignmentDataLogistic.EasingTighteningOnly[, -c(8, 9, 10)]USGG3M -3.3456
## AssignmentDataLogistic.EasingTighteningOnly[, -c(8, 9, 10)]USGG6M 4.1559
## AssignmentDataLogistic.EasingTighteningOnly[, -c(8, 9, 10)]USGG2YR 3.9460
## AssignmentDataLogistic.EasingTighteningOnly[, -c(8, 9, 10)]USGG3YR -3.4642
## AssignmentDataLogistic.EasingTighteningOnly[, -c(8, 9, 10)]USGG5YR -3.2115
## AssignmentDataLogistic.EasingTighteningOnly[, -c(8, 9, 10)]USGG10YR -0.9705
## AssignmentDataLogistic.EasingTighteningOnly[, -c(8, 9, 10)]USGG30YR 3.3254
## Std. Error
## (Intercept) 0.4312
## AssignmentDataLogistic.EasingTighteningOnly[, -c(8, 9, 10)]USGG3M 0.2666
## AssignmentDataLogistic.EasingTighteningOnly[, -c(8, 9, 10)]USGG6M 0.3748
## AssignmentDataLogistic.EasingTighteningOnly[, -c(8, 9, 10)]USGG2YR 0.7554
## AssignmentDataLogistic.EasingTighteningOnly[, -c(8, 9, 10)]USGG3YR 0.9340
## AssignmentDataLogistic.EasingTighteningOnly[, -c(8, 9, 10)]USGG5YR 0.7795
## AssignmentDataLogistic.EasingTighteningOnly[, -c(8, 9, 10)]USGG10YR 0.9764
## AssignmentDataLogistic.EasingTighteningOnly[, -c(8, 9, 10)]USGG30YR 0.6138
## z value
## (Intercept) -11.029
## AssignmentDataLogistic.EasingTighteningOnly[, -c(8, 9, 10)]USGG3M -12.548
## AssignmentDataLogistic.EasingTighteningOnly[, -c(8, 9, 10)]USGG6M 11.089
## AssignmentDataLogistic.EasingTighteningOnly[, -c(8, 9, 10)]USGG2YR 5.224
## AssignmentDataLogistic.EasingTighteningOnly[, -c(8, 9, 10)]USGG3YR -3.709
## AssignmentDataLogistic.EasingTighteningOnly[, -c(8, 9, 10)]USGG5YR -4.120
## AssignmentDataLogistic.EasingTighteningOnly[, -c(8, 9, 10)]USGG10YR -0.994
## AssignmentDataLogistic.EasingTighteningOnly[, -c(8, 9, 10)]USGG30YR 5.418
## Pr(>|z|)
## (Intercept) < 2e-16
## AssignmentDataLogistic.EasingTighteningOnly[, -c(8, 9, 10)]USGG3M < 2e-16
## AssignmentDataLogistic.EasingTighteningOnly[, -c(8, 9, 10)]USGG6M < 2e-16
## AssignmentDataLogistic.EasingTighteningOnly[, -c(8, 9, 10)]USGG2YR 1.75e-07
## AssignmentDataLogistic.EasingTighteningOnly[, -c(8, 9, 10)]USGG3YR 0.000208
## AssignmentDataLogistic.EasingTighteningOnly[, -c(8, 9, 10)]USGG5YR 3.79e-05
## AssignmentDataLogistic.EasingTighteningOnly[, -c(8, 9, 10)]USGG10YR 0.320214
## AssignmentDataLogistic.EasingTighteningOnly[, -c(8, 9, 10)]USGG30YR 6.04e-08
##
## (Intercept) ***
## AssignmentDataLogistic.EasingTighteningOnly[, -c(8, 9, 10)]USGG3M ***
## AssignmentDataLogistic.EasingTighteningOnly[, -c(8, 9, 10)]USGG6M ***
## AssignmentDataLogistic.EasingTighteningOnly[, -c(8, 9, 10)]USGG2YR ***
## AssignmentDataLogistic.EasingTighteningOnly[, -c(8, 9, 10)]USGG3YR ***
## AssignmentDataLogistic.EasingTighteningOnly[, -c(8, 9, 10)]USGG5YR ***
## AssignmentDataLogistic.EasingTighteningOnly[, -c(8, 9, 10)]USGG10YR
## AssignmentDataLogistic.EasingTighteningOnly[, -c(8, 9, 10)]USGG30YR ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2983.5 on 2357 degrees of freedom
## Residual deviance: 2629.6 on 2350 degrees of freedom
## AIC: 2645.6
##
## Number of Fisher Scoring iterations: 4
Explore the estimated model.
summary(LogisticModel.TighteningEasing_All)$aic
## [1] 2645.579
summary(LogisticModel.TighteningEasing_All)$coefficients[,c(1,4)]
## Estimate
## (Intercept) -4.7551928
## AssignmentDataLogistic.EasingTighteningOnly[, -c(8, 9, 10)]USGG3M -3.3456116
## AssignmentDataLogistic.EasingTighteningOnly[, -c(8, 9, 10)]USGG6M 4.1558535
## AssignmentDataLogistic.EasingTighteningOnly[, -c(8, 9, 10)]USGG2YR 3.9460296
## AssignmentDataLogistic.EasingTighteningOnly[, -c(8, 9, 10)]USGG3YR -3.4642455
## AssignmentDataLogistic.EasingTighteningOnly[, -c(8, 9, 10)]USGG5YR -3.2115319
## AssignmentDataLogistic.EasingTighteningOnly[, -c(8, 9, 10)]USGG10YR -0.9705444
## AssignmentDataLogistic.EasingTighteningOnly[, -c(8, 9, 10)]USGG30YR 3.3253517
## Pr(>|z|)
## (Intercept) 2.784283e-28
## AssignmentDataLogistic.EasingTighteningOnly[, -c(8, 9, 10)]USGG3M 4.073045e-36
## AssignmentDataLogistic.EasingTighteningOnly[, -c(8, 9, 10)]USGG6M 1.422964e-28
## AssignmentDataLogistic.EasingTighteningOnly[, -c(8, 9, 10)]USGG2YR 1.751687e-07
## AssignmentDataLogistic.EasingTighteningOnly[, -c(8, 9, 10)]USGG3YR 2.080617e-04
## AssignmentDataLogistic.EasingTighteningOnly[, -c(8, 9, 10)]USGG5YR 3.786229e-05
## AssignmentDataLogistic.EasingTighteningOnly[, -c(8, 9, 10)]USGG10YR 3.202140e-01
## AssignmentDataLogistic.EasingTighteningOnly[, -c(8, 9, 10)]USGG30YR 6.036041e-08
matplot(AssignmentDataLogistic.EasingTighteningOnly[,-c(9,10)],type="l",ylab="Results of Logistic Regression")
lines(AssignmentDataLogistic.EasingTighteningOnly[,10]*20,col="red")
lines(LogisticModel.TighteningEasing_All$fitted.values*20,col="green")
Interpret the coefficients of the model and the fitted values.
If k is the number of the parameters in the model, and L is the maximum value of the likelihood function for the model, then AIC is defined as 2k-2log(L). Since smaller AIC is prefered, it looks like using all inputs as predictors is better than just 3M input
Also, other than the USGG10yr, others all seem strongly related to tightening. For example, the USGG3M coefficient -3.3456116 means with all else holds equal, each unit increase of USGG3M will result in -3.3456 decrease in log odd or log(P/(1-P)), where P is the probability of tightening
Calculate and plot log-odds and probabilities. Compare probabilities with fitted values.
# Calculate odds
Log.Odds<-predict(LogisticModel.TighteningEasing_All)
plot(Log.Odds,type="l")
Probabilities<-1/(exp(-Log.Odds)+1) #from ln(P/(1-P)) to P
plot(LogisticModel.TighteningEasing_All$fitted.values,type="l",ylab="Fitted Values & Log-Odds")
lines(Probabilities,col="red")
#lines(Probabilities_test,col = "green")
Compare linear regression models with different combinations of predictors. Select the best combination.
Below we show only two of possible combinations: full model containing all 7 predictors and Null model containing only intercept, but none of the 7 predictors. Estimate other possible combinations.
AssignmentDataRegressionComparison<-data.matrix(AssignmentData[,-c(9,10)],rownames.force="automatic")
AssignmentDataRegressionComparison<-AssignmentData[,-c(9,10)]
Estimate the full model by using all 7 predictors.
RegressionModelComparison.Full <- lm(Output1 ~ ., data = AssignmentDataRegressionComparison)
names(summary(RegressionModelComparison.Full))
## [1] "call" "terms" "residuals" "coefficients"
## [5] "aliased" "sigma" "df" "r.squared"
## [9] "adj.r.squared" "fstatistic" "cov.unscaled"
#summary(RegressionModelComparison.Full)
Look at coefficients, \(R^2\), adjusted \(R^2\), degrees of freedom.
summary(RegressionModelComparison.Full)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -14.9041591 1.056850e-10 -141024294891 0
## USGG3M 0.3839609 9.860401e-11 3893968285 0
## USGG6M 0.3901870 1.500111e-10 2601053702 0
## USGG2YR 0.4151851 2.569451e-10 1615851177 0
## USGG3YR 0.4063541 3.299038e-10 1231735395 0
## USGG5YR 0.3860610 2.618339e-10 1474449865 0
## USGG10YR 0.3477544 2.800269e-10 1241860763 0
## USGG30YR 0.3047124 1.566487e-10 1945195584 0
summary(RegressionModelComparison.Full)$r.squared
## [1] 1
summary(RegressionModelComparison.Full)$adj.r.squared
## [1] 1
summary(RegressionModelComparison.Full)$df
## [1] 8 8292 8
Intepret the fitted model. How good is the fit? How significant are the parameters?
Too good, we might have overfitted, all the parameters are extremely significant
Estimate the Null model by including only intercept.
RegressionModelComparison.Null <- lm(Output1 ~ 1, data = AssignmentDataRegressionComparison)
summary(RegressionModelComparison.Null)
##
## Call:
## lm(formula = Output1 ~ 1, data = AssignmentDataRegressionComparison)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.173 -6.509 -0.415 4.860 27.298
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.42e-11 9.62e-02 0 1
##
## Residual standard error: 8.764 on 8299 degrees of freedom
summary(RegressionModelComparison.Null)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.420082e-11 0.09619536 1.476248e-10 1
summary(RegressionModelComparison.Null)$r.squared
## [1] 0
summary(RegressionModelComparison.Null)$adj.r.squared
## [1] 0
summary(RegressionModelComparison.Null)$df
## [1] 1 8299 1
Why summary(RegressionModelComparison.Null) does not show \(R^2\)?
Because there is only beta0, no other betas. Thus, as shown below, SSM are almost zeros, SST = SSE, and \(R^2\) = SSM/SST = 0, the model is useless.
(SSM = sum((RegressionModelComparison.Null$fitted.values - mean(RegressionModelComparison.Null$fitted.values))^2))
## [1] 6.052545e-24
(SSE = sum((AssignmentDataRegressionComparison$Output1 - RegressionModelComparison.Null$fitted.values)^2))
## [1] 637400
(SST = SSE+SSM)
## [1] 637400
Compare models pairwise using anova()
anova(RegressionModelComparison.Full,RegressionModelComparison.Null)
## Analysis of Variance Table
##
## Model 1: Output1 ~ USGG3M + USGG6M + USGG2YR + USGG3YR + USGG5YR + USGG10YR +
## USGG30YR
## Model 2: Output1 ~ 1
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 8292 0
## 2 8299 637400 -7 -637400 3.73e+22 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpret the results of anova().
Not all betas are zeros for sure, however, the full model explains all the variations, which seems overfitting and might not be the best thing.
Repeat the analysis for different combinations of input variables and select the one you think is the best.Explain your selection.
drop1(RegressionModelComparison.Full)
## Warning: attempting model selection on an essentially perfect fit is
## nonsense
## Single term deletions
##
## Model:
## Output1 ~ USGG3M + USGG6M + USGG2YR + USGG3YR + USGG5YR + USGG10YR +
## USGG30YR
## Df Sum of Sq RSS AIC
## <none> 0.000 -336591
## USGG3M 1 37.016 37.016 -44911
## USGG6M 1 16.516 16.516 -51609
## USGG2YR 1 6.374 6.374 -59512
## USGG3YR 1 3.704 3.704 -64018
## USGG5YR 1 5.307 5.307 -61032
## USGG10YR 1 3.765 3.765 -63882
## USGG30YR 1 9.237 9.237 -56433
pairs(AssignmentDataRegressionComparison)
drop1 and pairs do not seem like good model selection methods in this case.
(myScope<-names(AssignmentDataRegressionComparison)[-8])
## [1] "USGG3M" "USGG6M" "USGG2YR" "USGG3YR" "USGG5YR" "USGG10YR"
## [7] "USGG30YR"
add1(RegressionModelComparison.Null,scope=myScope)
## Single term additions
##
## Model:
## Output1 ~ 1
## Df Sum of Sq RSS AIC
## <none> 637400 36033
## USGG3M 1 613692 23708 8715
## USGG6M 1 621075 16325 5618
## USGG2YR 1 635252 2148 -11217
## USGG3YR 1 636075 1325 -15226
## USGG5YR 1 632104 5296 -3725
## USGG10YR 1 617761 19639 7153
## USGG30YR 1 596181 41219 13306
USGG3YR should be added
RegressionModelComparison.1 = lm(Output1 ~ USGG3YR, data = AssignmentDataRegressionComparison)
summary(RegressionModelComparison.1)
##
## Call:
## lm(formula = Output1 ~ USGG3YR, data = AssignmentDataRegressionComparison)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0160 -0.2459 0.0325 0.2638 3.0666
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -13.861618 0.008214 -1688 <2e-16 ***
## USGG3YR 2.455793 0.001230 1996 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3996 on 8298 degrees of freedom
## Multiple R-squared: 0.9979, Adjusted R-squared: 0.9979
## F-statistic: 3.984e+06 on 1 and 8298 DF, p-value: < 2.2e-16
note that the adj. R is already extremely high
(myScope<-names(AssignmentDataRegressionComparison)[-c(4,8)])
## [1] "USGG3M" "USGG6M" "USGG2YR" "USGG5YR" "USGG10YR" "USGG30YR"
add1(RegressionModelComparison.1,scope=myScope)
## Single term additions
##
## Model:
## Output1 ~ USGG3YR
## Df Sum of Sq RSS AIC
## <none> 1324.83 -15226
## USGG3M 1 407.98 916.85 -18279
## USGG6M 1 270.17 1054.66 -17117
## USGG2YR 1 82.93 1241.91 -15761
## USGG5YR 1 20.94 1303.89 -15356
## USGG10YR 1 64.05 1260.78 -15636
## USGG30YR 1 100.79 1224.04 -15881
RegressionModelComparison.2 = lm(Output1 ~ USGG3YR + USGG3M, data = AssignmentDataRegressionComparison)
summary(RegressionModelComparison.2)
##
## Call:
## lm(formula = Output1 ~ USGG3YR + USGG3M, data = AssignmentDataRegressionComparison)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.19592 -0.25503 0.01678 0.24577 1.77420
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -13.671658 0.007515 -1819.36 <2e-16 ***
## USGG3YR 2.171934 0.004782 454.14 <2e-16 ***
## USGG3M 0.302081 0.004972 60.76 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3324 on 8297 degrees of freedom
## Multiple R-squared: 0.9986, Adjusted R-squared: 0.9986
## F-statistic: 2.88e+06 on 2 and 8297 DF, p-value: < 2.2e-16
anova(RegressionModelComparison.1,RegressionModelComparison.2)
## Analysis of Variance Table
##
## Model 1: Output1 ~ USGG3YR
## Model 2: Output1 ~ USGG3YR + USGG3M
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 8298 1324.83
## 2 8297 916.85 1 407.98 3692 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova result show that they are quite different. So we can keep both.
(myScope<-names(AssignmentDataRegressionComparison)[-c(1,4,8)])
## [1] "USGG6M" "USGG2YR" "USGG5YR" "USGG10YR" "USGG30YR"
add1(RegressionModelComparison.2,scope=myScope)
## Single term additions
##
## Model:
## Output1 ~ USGG3YR + USGG3M
## Df Sum of Sq RSS AIC
## <none> 916.85 -18279
## USGG6M 1 139.91 776.95 -19652
## USGG2YR 1 176.99 739.86 -20058
## USGG5YR 1 736.49 180.36 -31773
## USGG10YR 1 864.29 52.56 -42007
## USGG30YR 1 863.62 53.23 -41902
We should add USGG10YR next.
RegressionModelComparison.3 = lm(Output1 ~ USGG3YR + USGG3M + USGG10YR, data = AssignmentDataRegressionComparison)
summary(RegressionModelComparison.3)
##
## Call:
## lm(formula = Output1 ~ USGG3YR + USGG3M + USGG10YR, data = AssignmentDataRegressionComparison)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.42997 -0.04207 0.00512 0.04825 0.69394
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -14.723583 0.003369 -4370.4 <2e-16 ***
## USGG3YR 1.120774 0.003068 365.3 <2e-16 ***
## USGG3M 0.705571 0.001616 436.7 <2e-16 ***
## USGG10YR 0.786689 0.002130 369.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0796 on 8296 degrees of freedom
## Multiple R-squared: 0.9999, Adjusted R-squared: 0.9999
## F-statistic: 3.353e+07 on 3 and 8296 DF, p-value: < 2.2e-16
anova(RegressionModelComparison.2,RegressionModelComparison.3)
## Analysis of Variance Table
##
## Model 1: Output1 ~ USGG3YR + USGG3M
## Model 2: Output1 ~ USGG3YR + USGG3M + USGG10YR
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 8297 916.85
## 2 8296 52.56 1 864.29 136412 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova show that it’s significant again. and adj R is close to 1. For now, we picked one short term, one median and one long term as predictors.
Relative importance measures
suppressMessages(library(relaimpo))
metrics.Full <- calc.relimp(RegressionModelComparison.Full, type = c("lmg", "first", "last","betasq", "pratt"))
## Warning in rev(variances[[p]]) - variances[[p + 1]]: Recycling array of length 1 in vector-array arithmetic is deprecated.
## Use c() or as.vector() instead.
metrics.Full
## Response variable: Output1
## Total response variance: 76.80444
## Analysis based on 8300 observations
##
## 7 Regressors:
## USGG3M USGG6M USGG2YR USGG3YR USGG5YR USGG10YR USGG30YR
## Proportion of variance explained by model: 100%
## Metrics are not normalized (rela=FALSE).
##
## Relative importance metrics:
##
## lmg last first betasq pratt
## USGG3M 0.1404709 5.807356e-05 0.9628054 0.022574051 0.14742597
## USGG6M 0.1425004 2.591148e-05 0.9743884 0.023788052 0.15224586
## USGG2YR 0.1466582 9.999916e-06 0.9966307 0.029814849 0.17237863
## USGG3YR 0.1468640 5.810700e-06 0.9979215 0.027322621 0.16512368
## USGG5YR 0.1457366 8.326330e-06 0.9916908 0.022399959 0.14904306
## USGG10YR 0.1417725 5.906625e-06 0.9691884 0.015089760 0.12093312
## USGG30YR 0.1359975 1.449173e-05 0.9353333 0.009217098 0.09284966
##
## Average coefficients for different model sizes:
##
## 1X 2Xs 3Xs 4Xs 5Xs 6Xs
## USGG3M 2.507561 0.2263472 0.6109835 0.5227870 0.4628923 0.4211955
## USGG6M 2.497235 1.4470531 0.7276930 0.6852762 0.5880926 0.4868271
## USGG2YR 2.400449 1.8713892 1.2496809 0.9362129 0.6876128 0.4720868
## USGG3YR 2.455793 2.1971757 1.2775152 0.6434783 0.4689353 0.4963413
## USGG5YR 2.568742 1.9951251 1.0962316 0.7209641 0.5450640 0.4051945
## USGG10YR 2.786991 1.4935881 0.5874971 0.6759481 0.5704141 0.4630193
## USGG30YR 3.069561 -0.4030409 0.4839594 0.3866284 0.3539104 0.3276156
## 7Xs
## USGG3M 0.3839609
## USGG6M 0.3901870
## USGG2YR 0.4151851
## USGG3YR 0.4063541
## USGG5YR 0.3860610
## USGG10YR 0.3477544
## USGG30YR 0.3047124
(metrics.Full.rank<-metrics.Full@lmg.rank)
## USGG3M USGG6M USGG2YR USGG3YR USGG5YR USGG10YR USGG30YR
## 6 4 2 1 3 5 7
If we were to use three predictors, relative importance measures gave us a model with USGG3YR, SGG2YR, USGG5YR.
Selection of predictors based on regsubsets()
library(leaps)
subsetsFull<-regsubsets(x=AssignmentDataRegressionComparison[,1:7],y=AssignmentDataRegressionComparison[,8])
summary(subsetsFull)$which
## (Intercept) USGG3M USGG6M USGG2YR USGG3YR USGG5YR USGG10YR USGG30YR
## 1 TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## 2 TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
## 3 TRUE TRUE FALSE TRUE FALSE FALSE TRUE FALSE
## 4 TRUE TRUE TRUE FALSE TRUE FALSE TRUE FALSE
## 5 TRUE TRUE TRUE TRUE FALSE TRUE FALSE TRUE
## 6 TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
## 7 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
As can be seen, the three predictor model given by regsubset is a model by USGG3M, USGG2YR, USGG10YR, this is in fact very close to what we chose with Add1.
Final selection:I decided to choose USGG3M + USGG3YR + USGG10YR as our predictors based on Add1() and regsebset().
Perform rolling window analysis of the yields data. Use package zoo for rolling window analysis.
Set the window width and window shift parameters for rolling window.
Window.width<-20; Window.shift<-5
Run rolling mean values usingrollapply().
library(zoo)
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
Calculate rolling mean values for each variable.
# Means
all.means<-rollapply(AssignmentDataRegressionComparison,width=Window.width,by=Window.shift,by.column=TRUE, mean)
head(all.means,10)
## USGG3M USGG6M USGG2YR USGG3YR USGG5YR USGG10YR USGG30YR Output1
## [1,] 15.0405 14.0855 13.2795 12.9360 12.7825 12.5780 12.1515 20.14842
## [2,] 15.1865 14.1440 13.4855 13.1085 12.9310 12.7370 12.3370 20.55208
## [3,] 15.2480 14.2755 13.7395 13.3390 13.1500 12.9480 12.5500 21.04895
## [4,] 14.9345 14.0780 13.7750 13.4765 13.2385 13.0515 12.6610 21.02611
## [5,] 14.7545 14.0585 13.9625 13.6890 13.4600 13.2295 12.8335 21.31356
## [6,] 14.6025 14.0115 14.0380 13.7790 13.5705 13.3050 12.8890 21.39061
## [7,] 14.0820 13.5195 13.8685 13.6710 13.4815 13.1880 12.7660 20.77200
## [8,] 13.6255 13.0635 13.6790 13.5735 13.4270 13.1260 12.6950 20.23626
## [9,] 13.2490 12.6810 13.5080 13.4575 13.3680 13.0770 12.6470 19.76988
## [10,] 12.9545 12.4225 13.4140 13.4240 13.3850 13.1115 12.6755 19.53054
# first mean is row 1-row20, the next mean is row 6- row25, so there there should be ~8300/5 =1660 means, that's why all means has 1657 rows.
# Create points at which rolling means are calculated
Count<-1:length(AssignmentDataRegressionComparison[,1]) #Count is a vector of 1:8300
Rolling.window.matrix<-rollapply(Count,width=Window.width,by=Window.shift,by.column=FALSE,
FUN=function(z) z)
Rolling.window.matrix[1:10,] # how is firt 10 means calculated.
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## [1,] 1 2 3 4 5 6 7 8 9 10 11 12 13
## [2,] 6 7 8 9 10 11 12 13 14 15 16 17 18
## [3,] 11 12 13 14 15 16 17 18 19 20 21 22 23
## [4,] 16 17 18 19 20 21 22 23 24 25 26 27 28
## [5,] 21 22 23 24 25 26 27 28 29 30 31 32 33
## [6,] 26 27 28 29 30 31 32 33 34 35 36 37 38
## [7,] 31 32 33 34 35 36 37 38 39 40 41 42 43
## [8,] 36 37 38 39 40 41 42 43 44 45 46 47 48
## [9,] 41 42 43 44 45 46 47 48 49 50 51 52 53
## [10,] 46 47 48 49 50 51 52 53 54 55 56 57 58
## [,14] [,15] [,16] [,17] [,18] [,19] [,20]
## [1,] 14 15 16 17 18 19 20
## [2,] 19 20 21 22 23 24 25
## [3,] 24 25 26 27 28 29 30
## [4,] 29 30 31 32 33 34 35
## [5,] 34 35 36 37 38 39 40
## [6,] 39 40 41 42 43 44 45
## [7,] 44 45 46 47 48 49 50
## [8,] 49 50 51 52 53 54 55
## [9,] 54 55 56 57 58 59 60
## [10,] 59 60 61 62 63 64 65
# Take middle of each window
Points.of.calculation<-Rolling.window.matrix[,10]
Points.of.calculation[1:10]
## [1] 10 15 20 25 30 35 40 45 50 55
length(Points.of.calculation)
## [1] 1657
# Incert means into the total length vector to plot the rolling mean with the original data
Means.forPlot<-rep(NA,length(AssignmentDataRegressionComparison[,1])) #define a 8300 length NA
Means.forPlot[Points.of.calculation]<-all.means[,1] #Means.forPlot's 10th, 15th 20th...1655 th positiion is inserted the of rolling window means from 3M, other places remains NA
Means.forPlot[1:50]
## [1] NA NA NA NA NA NA NA NA
## [9] NA 15.0405 NA NA NA NA 15.1865 NA
## [17] NA NA NA 15.2480 NA NA NA NA
## [25] 14.9345 NA NA NA NA 14.7545 NA NA
## [33] NA NA 14.6025 NA NA NA NA 14.0820
## [41] NA NA NA NA 13.6255 NA NA NA
## [49] NA 13.2490
cbind(AssignmentDataRegressionComparison[,1],Means.forPlot)[1:50,]
## Means.forPlot
## [1,] 13.52 NA
## [2,] 13.58 NA
## [3,] 14.50 NA
## [4,] 14.76 NA
## [5,] 15.20 NA
## [6,] 15.22 NA
## [7,] 15.24 NA
## [8,] 15.08 NA
## [9,] 15.25 NA
## [10,] 15.15 15.0405
## [11,] 15.79 NA
## [12,] 15.32 NA
## [13,] 15.71 NA
## [14,] 15.72 NA
## [15,] 15.70 15.1865
## [16,] 15.35 NA
## [17,] 15.20 NA
## [18,] 15.06 NA
## [19,] 14.87 NA
## [20,] 14.59 15.2480
## [21,] 14.90 NA
## [22,] 14.85 NA
## [23,] 14.67 NA
## [24,] 14.74 NA
## [25,] 15.32 14.9345
## [26,] 15.52 NA
## [27,] 15.46 NA
## [28,] 15.54 NA
## [29,] 15.51 NA
## [30,] 15.14 14.7545
## [31,] 15.02 NA
## [32,] 14.48 NA
## [33,] 14.09 NA
## [34,] 14.23 NA
## [35,] 14.15 14.6025
## [36,] 14.20 NA
## [37,] 14.14 NA
## [38,] 14.22 NA
## [39,] 14.52 NA
## [40,] 14.39 14.0820
## [41,] 14.49 NA
## [42,] 14.51 NA
## [43,] 14.29 NA
## [44,] 14.16 NA
## [45,] 13.99 13.6255
## [46,] 13.92 NA
## [47,] 13.66 NA
## [48,] 13.21 NA
## [49,] 13.02 NA
## [50,] 12.95 13.2490
plot(Means.forPlot,col="red")
lines(AssignmentDataRegressionComparison[,1])
Run rolling daily difference standard deviation of each variable
all.diff<- data.frame(diff(as.matrix(AssignmentDataRegressionComparison)))
head(all.diff,10)
## USGG3M USGG6M USGG2YR USGG3YR USGG5YR USGG10YR USGG30YR
## 1/6/1981 0.06 0.07 0.14 0.03 -0.08 -0.04 0.00
## 1/7/1981 0.92 0.74 0.50 0.47 0.40 0.27 0.22
## 1/8/1981 0.26 0.10 0.17 0.17 0.07 -0.03 0.02
## 1/9/1981 0.44 0.30 0.44 0.33 0.20 0.22 0.22
## 1/12/1981 0.02 -0.07 -0.36 -0.34 -0.17 -0.12 -0.05
## 1/13/1981 0.02 -0.13 0.13 0.03 -0.03 0.08 0.00
## 1/14/1981 -0.16 -0.20 -0.35 -0.22 -0.07 0.00 -0.01
## 1/15/1981 0.17 0.19 0.30 0.27 0.16 0.09 0.18
## 1/16/1981 -0.10 -0.11 -0.17 -0.17 -0.11 -0.09 -0.12
## 1/19/1981 0.64 0.75 0.54 0.07 0.26 0.11 0.12
## Output1
## 1/6/1981 0.07587222
## 1/7/1981 1.35591615
## 1/8/1981 0.30119608
## 1/9/1981 0.82353207
## 1/12/1981 -0.42985741
## 1/13/1981 0.03935812
## 1/14/1981 -0.40425521
## 1/15/1981 0.52159590
## 1/16/1981 -0.33130842
## 1/19/1981 0.96621424
rolling.sd = rollapply(all.diff,width=Window.width,by=Window.shift,by.column=TRUE, sd)
head(rolling.sd)
## USGG3M USGG6M USGG2YR USGG3YR USGG5YR USGG10YR USGG30YR
## [1,] 0.3433258 0.3262462 0.2748258 0.2030161 0.1713192 0.1299585 0.1202147
## [2,] 0.2933383 0.2907504 0.2261811 0.1499219 0.1450082 0.1146895 0.1192201
## [3,] 0.2613180 0.2437530 0.2006201 0.1632596 0.1654110 0.1459308 0.1351909
## [4,] 0.2551754 0.2469663 0.1989446 0.1692794 0.1717219 0.1551052 0.1422183
## [5,] 0.2480551 0.2481595 0.2102004 0.1786057 0.1744767 0.1643960 0.1516540
## [6,] 0.1963884 0.2363672 0.2095082 0.1809180 0.1822917 0.1664956 0.1537351
## Output1
## [1,] 0.5639875
## [2,] 0.4707427
## [3,] 0.4681168
## [4,] 0.4786189
## [5,] 0.4888569
## [6,] 0.4788897
rolling.dates<-rollapply(AssignmentDataRegressionComparison[-1,],width=Window.width,by=Window.shift,
by.column=FALSE,FUN=function(z) rownames(z))
head(rolling.dates)
## [,1] [,2] [,3] [,4] [,5]
## [1,] "1/6/1981" "1/7/1981" "1/8/1981" "1/9/1981" "1/12/1981"
## [2,] "1/13/1981" "1/14/1981" "1/15/1981" "1/16/1981" "1/19/1981"
## [3,] "1/20/1981" "1/21/1981" "1/22/1981" "1/23/1981" "1/26/1981"
## [4,] "1/27/1981" "1/28/1981" "1/29/1981" "1/30/1981" "2/2/1981"
## [5,] "2/3/1981" "2/4/1981" "2/5/1981" "2/6/1981" "2/9/1981"
## [6,] "2/10/1981" "2/11/1981" "2/13/1981" "2/17/1981" "2/18/1981"
## [,6] [,7] [,8] [,9] [,10]
## [1,] "1/13/1981" "1/14/1981" "1/15/1981" "1/16/1981" "1/19/1981"
## [2,] "1/20/1981" "1/21/1981" "1/22/1981" "1/23/1981" "1/26/1981"
## [3,] "1/27/1981" "1/28/1981" "1/29/1981" "1/30/1981" "2/2/1981"
## [4,] "2/3/1981" "2/4/1981" "2/5/1981" "2/6/1981" "2/9/1981"
## [5,] "2/10/1981" "2/11/1981" "2/13/1981" "2/17/1981" "2/18/1981"
## [6,] "2/19/1981" "2/20/1981" "2/23/1981" "2/24/1981" "2/25/1981"
## [,11] [,12] [,13] [,14] [,15]
## [1,] "1/20/1981" "1/21/1981" "1/22/1981" "1/23/1981" "1/26/1981"
## [2,] "1/27/1981" "1/28/1981" "1/29/1981" "1/30/1981" "2/2/1981"
## [3,] "2/3/1981" "2/4/1981" "2/5/1981" "2/6/1981" "2/9/1981"
## [4,] "2/10/1981" "2/11/1981" "2/13/1981" "2/17/1981" "2/18/1981"
## [5,] "2/19/1981" "2/20/1981" "2/23/1981" "2/24/1981" "2/25/1981"
## [6,] "2/26/1981" "2/27/1981" "3/2/1981" "3/3/1981" "3/4/1981"
## [,16] [,17] [,18] [,19] [,20]
## [1,] "1/27/1981" "1/28/1981" "1/29/1981" "1/30/1981" "2/2/1981"
## [2,] "2/3/1981" "2/4/1981" "2/5/1981" "2/6/1981" "2/9/1981"
## [3,] "2/10/1981" "2/11/1981" "2/13/1981" "2/17/1981" "2/18/1981"
## [4,] "2/19/1981" "2/20/1981" "2/23/1981" "2/24/1981" "2/25/1981"
## [5,] "2/26/1981" "2/27/1981" "3/2/1981" "3/3/1981" "3/4/1981"
## [6,] "3/5/1981" "3/6/1981" "3/9/1981" "3/10/1981" "3/11/1981"
rownames(rolling.sd)<-rolling.dates[,10]
head(rolling.sd) # made a data frame of rolling sds of daily differences with rownames of middle date,
## USGG3M USGG6M USGG2YR USGG3YR USGG5YR USGG10YR
## 1/19/1981 0.3433258 0.3262462 0.2748258 0.2030161 0.1713192 0.1299585
## 1/26/1981 0.2933383 0.2907504 0.2261811 0.1499219 0.1450082 0.1146895
## 2/2/1981 0.2613180 0.2437530 0.2006201 0.1632596 0.1654110 0.1459308
## 2/9/1981 0.2551754 0.2469663 0.1989446 0.1692794 0.1717219 0.1551052
## 2/18/1981 0.2480551 0.2481595 0.2102004 0.1786057 0.1744767 0.1643960
## 2/25/1981 0.1963884 0.2363672 0.2095082 0.1809180 0.1822917 0.1664956
## USGG30YR Output1
## 1/19/1981 0.1202147 0.5639875
## 1/26/1981 0.1192201 0.4707427
## 2/2/1981 0.1351909 0.4681168
## 2/9/1981 0.1422183 0.4786189
## 2/18/1981 0.1516540 0.4888569
## 2/25/1981 0.1537351 0.4788897
matplot(rolling.sd[,c(1,5,7,8)],xaxt="n",type="l",col=c("black","red","blue","green")) # 3M, 5Y and 30Y and output
axis(side=1,at=1:1656,rownames(rolling.sd))
Show periods of high volatility.
high.volatility.periods<-rownames(rolling.sd)[rolling.sd[,8]>.5]
high.volatility.periods
## [1] "1/19/1981" "3/25/1981" "4/1/1981" "4/8/1981" "4/23/1981"
## [6] "4/30/1981" "5/7/1981" "5/14/1981" "5/21/1981" "5/29/1981"
## [11] "6/5/1981" "6/12/1981" "6/19/1981" "6/26/1981" "7/6/1981"
## [16] "7/13/1981" "7/20/1981" "7/27/1981" "10/28/1981" "11/5/1981"
## [21] "11/13/1981" "11/20/1981" "11/30/1981" "12/7/1981" "12/14/1981"
## [26] "12/29/1981" "1/14/1982" "1/21/1982" "1/28/1982" "2/4/1982"
## [31] "2/11/1982" "7/29/1982" "8/5/1982" "8/12/1982" "8/19/1982"
## [36] "8/26/1982" "9/24/1982" "10/1/1982" "10/8/1982" "10/18/1982"
## [41] "10/13/1987" "10/20/1987" "10/27/1987" "11/19/2007" "11/26/2007"
## [46] "11/12/2008" "11/19/2008"
How is volatility related to the level of rates?
As can be seen in the chart, high volatility is related the most to high fluctuation of rates of the USGG3M (black), then 6YR (red) and least with 30YR (blue)
Fit linear model to rolling window data using 3 months, 5 years and 30 years variables as predictors.
# Rolling lm coefficients
Coefficients<-rollapply(AssignmentDataRegressionComparison,width=Window.width,by=Window.shift,by.column=FALSE, FUN=function(z) coef(lm(Output1~USGG3M+USGG5YR+USGG30YR,data=as.data.frame(z))))
rolling.dates<-rollapply(AssignmentDataRegressionComparison[,1:8],width=Window.width,by=Window.shift,by.column=FALSE, FUN=function(z) rownames(z)) #included the first row. why
rownames(Coefficients)<-rolling.dates[,10]
Coefficients[1:10,]
## (Intercept) USGG3M USGG5YR USGG30YR
## 1/16/1981 -15.70877 0.6723609 1.797680 0.2276011
## 1/23/1981 -15.96684 0.6948992 1.480514 0.5529139
## 1/30/1981 -16.77273 0.7078197 1.434388 0.6507280
## 2/6/1981 -16.90734 0.7279033 1.470083 0.6003377
## 2/17/1981 -17.46896 0.7343406 1.361289 0.7499705
## 2/24/1981 -17.04722 0.7357663 1.295641 0.7844908
## 3/3/1981 -17.67901 0.8544681 1.396653 0.5945022
## 3/10/1981 -17.72402 0.9162385 1.654274 0.2571200
## 3/17/1981 -17.00260 0.9265767 1.647852 0.1951273
## 3/24/1981 -16.84302 0.9102780 1.477727 0.3788401
Look at pairwise X-Y plots of regression coefficients for the 3M, 5Yr and 30Yr yields as inputs.
# Pairs plot of Coefficients
pairs(Coefficients)
Interpret the pairs plot.
coefficients of USGG5YR and USGG30YR have strong negative correlation, which means everything else stays the same, USGG5YR always changes the OUTPUT the opposite direction as USGG30YR does.
coefficients of USGG3M don’t have strong correlation with either USGG5YR or USGG30YR, which means it change the OUTPUT in it’s own way.
When intercept (related to previous OUTPUT change) is positive, USGG5YR’s coefficient is also positive, whereas USGG30YR coefficient is negative. Vice versa.
Plot the coefficients. Show periods.
# Plot of coefficients
matplot(Coefficients[,-1],xaxt="n",type="l",col=c("black","red","green"))
axis(side=1,at=1:1657,rownames(Coefficients))
high.slopespread.periods<-rownames(Coefficients)[Coefficients[,3]-Coefficients[,4]>3] # USGG5YR and USGG30YR coefficient difference larger than 3
jump.slopes<-rownames(Coefficients)[Coefficients[,3]>3] #USGG5YR coefficient larger than 3 itself
high.slopespread.periods
## [1] "4/26/1982" "12/15/1982" "9/16/1985" "5/12/1987" "5/19/1987"
## [6] "5/27/1987" "9/25/1987" "3/15/1988" "9/27/1988" "10/5/1988"
## [11] "3/10/1989" "2/5/1992" "8/3/1994" "12/8/1994" "6/14/1996"
## [16] "5/9/1997" "5/16/1997" "5/23/1997" "5/30/1997" "12/26/2000"
## [21] "1/2/2001" "7/25/2001" "8/1/2001" "11/13/2003" "8/12/2004"
## [26] "12/16/2004"
jump.slopes
## [1] "12/16/2004"
Is the picture of coefficients consistent with the picture of pairs? If yes, explain why.
Yes, green and red are almost opposite direction, as we see from pairs(), whereas black seems to have no relation with green and red.
How often the R-squared is not considered high?
# R-squared
r.squared<-rollapply(AssignmentDataRegressionComparison,width=Window.width,by=Window.shift,by.column=FALSE,
FUN=function(z) summary(lm(Output1~USGG3M+USGG5YR+USGG30YR,data=as.data.frame(z)))$r.squared)
r.squared<-cbind(rolling.dates[,10],r.squared)
r.squared[1:10,]
## r.squared
## [1,] "1/16/1981" "0.995046300986446"
## [2,] "1/23/1981" "0.992485868136646"
## [3,] "1/30/1981" "0.998641209587999"
## [4,] "2/6/1981" "0.998849080081881"
## [5,] "2/17/1981" "0.997958757207598"
## [6,] "2/24/1981" "0.996489757136839"
## [7,] "3/3/1981" "0.99779753570421"
## [8,] "3/10/1981" "0.998963395226792"
## [9,] "3/17/1981" "0.998729445388789"
## [10,] "3/24/1981" "0.997073000898673"
plot(r.squared[,2],xaxt="n",ylim=c(0,1))
axis(side=1,at=1:1657,rownames(Coefficients))
(low.r.squared.periods<-r.squared[r.squared[,2]<.9,1])
## [1] "6/24/1987" "6/27/1991" "4/28/2005" "6/20/2012"
What could cause decrease of \(R^2\)?
One reason could be these days, the predictors we chose cannot predict the OUTPUT very well, instead, we could try using 6M, 2YR or 10YR as predictors
Or, there is event happening in the market on those days that causes correlation to decrease.
Analyze the rolling p-values.
# P-values
Pvalues<-rollapply(AssignmentDataRegressionComparison,width=Window.width,by=Window.shift,by.column=FALSE,
FUN=function(z) summary(lm(Output1~USGG3M+USGG5YR+USGG30YR,data=as.data.frame(z)))$coefficients[,4])
rownames(Pvalues)<-rolling.dates[,10]
Pvalues[1:10,]
## (Intercept) USGG3M USGG5YR USGG30YR
## 1/16/1981 1.193499e-10 3.764585e-10 2.391260e-07 2.538852e-01
## 1/23/1981 3.751077e-12 1.008053e-11 2.447369e-07 5.300949e-03
## 1/30/1981 3.106359e-18 1.406387e-14 4.040035e-09 3.626961e-05
## 2/6/1981 2.591522e-19 3.360104e-19 3.828054e-11 2.221691e-05
## 2/17/1981 1.897239e-16 6.578118e-17 1.461743e-09 1.331767e-04
## 2/24/1981 2.341158e-13 1.000212e-13 9.008221e-07 4.733543e-03
## 3/3/1981 5.435581e-14 1.535503e-11 3.357199e-06 6.010473e-02
## 3/10/1981 6.227624e-16 1.178498e-16 1.679479e-05 3.851840e-01
## 3/17/1981 9.592582e-17 7.065226e-20 1.459692e-05 5.025726e-01
## 3/24/1981 8.248747e-16 6.689840e-16 6.413371e-04 3.052705e-01
matplot(Pvalues,xaxt="n",col=c("black","blue","red","green"),type="o")
axis(side=1,at=1:1657,rownames(Coefficients))
rownames(Pvalues)[Pvalues[,2]>.5] #days that USGG3M is not a best predictor
## [1] "7/15/1992" "7/26/1996" "8/2/1996" "11/7/2000" "5/30/2001"
## [6] "5/2/2002" "5/16/2002" "5/23/2002" "1/30/2003" "2/6/2003"
## [11] "7/24/2003" "7/31/2003" "8/7/2003" "11/20/2003" "12/18/2003"
## [16] "4/28/2005" "2/10/2006" "3/9/2007" "3/16/2007" "7/21/2009"
## [21] "10/6/2009" "10/13/2009" "12/28/2010" "1/11/2011" "3/1/2011"
## [26] "11/16/2011" "11/23/2011" "5/23/2012" "7/11/2012" "6/6/2013"
## [31] "1/16/2014" "1/30/2014" "3/6/2014"
rownames(Pvalues)[Pvalues[,3]>.5] # days that USGG5YR is not a best predictor
## [1] "12/1/1982" "3/16/1987" "4/28/1987" "6/24/1987" "9/3/1987" "9/11/1987"
## [7] "9/20/1988" "12/3/1999"
rownames(Pvalues)[Pvalues[,4]>.5] # days that USGG30YR is not a predictor
## [1] "3/17/1981" "4/22/1981" "4/29/1981" "6/4/1981" "10/13/1981"
## [6] "11/19/1981" "2/3/1982" "2/26/1982" "4/2/1982" "4/12/1982"
## [11] "5/3/1982" "7/7/1982" "9/1/1982" "9/23/1982" "12/8/1982"
## [16] "2/18/1983" "3/1/1983" "5/4/1983" "12/30/1983" "1/9/1984"
## [21] "2/6/1984" "3/14/1984" "3/21/1984" "4/11/1984" "6/22/1984"
## [26] "6/29/1984" "10/31/1984" "11/16/1984" "11/26/1984" "12/17/1984"
## [31] "3/25/1985" "5/14/1985" "5/21/1985" "8/30/1985" "9/9/1985"
## [36] "9/23/1985" "10/1/1985" "10/8/1985" "10/16/1985" "10/23/1985"
## [41] "10/30/1985" "11/14/1985" "11/21/1985" "1/22/1986" "1/29/1986"
## [46] "5/5/1987" "12/16/1987" "1/25/1988" "2/1/1988" "2/16/1988"
## [51] "3/1/1988" "3/22/1988" "5/18/1988" "6/3/1988" "6/10/1988"
## [56] "6/24/1988" "7/25/1988" "8/15/1988" "12/5/1988" "2/2/1989"
## [61] "3/3/1989" "4/10/1989" "5/1/1989" "6/13/1989" "8/16/1989"
## [66] "9/14/1989" "9/21/1989" "10/3/1989" "10/11/1989" "10/18/1989"
## [71] "11/1/1989" "11/30/1989" "12/7/1989" "1/8/1990" "1/16/1990"
## [76] "6/15/1990" "7/30/1990" "8/6/1990" "10/2/1990" "10/10/1990"
## [81] "11/23/1990" "3/1/1991" "5/13/1991" "5/20/1991" "6/13/1991"
## [86] "7/5/1991" "7/19/1991" "9/16/1991" "2/12/1992" "2/20/1992"
## [91] "3/12/1992" "4/16/1992" "4/24/1992" "5/1/1992" "5/8/1992"
## [96] "6/2/1992" "6/9/1992" "6/16/1992" "8/19/1992" "8/26/1992"
## [101] "10/23/1992" "5/20/1993" "6/11/1993" "6/18/1993" "8/30/1993"
## [106] "12/15/1993" "12/22/1993" "3/16/1994" "3/30/1994" "4/6/1994"
## [111] "4/20/1994" "4/27/1994" "6/29/1994" "8/17/1994" "9/21/1994"
## [116] "9/28/1994" "12/22/1994" "12/29/1994" "1/5/1995" "1/12/1995"
## [121] "1/26/1995" "2/2/1995" "2/9/1995" "4/6/1995" "4/13/1995"
## [126] "8/25/1995" "9/29/1995" "10/27/1995" "11/3/1995" "11/10/1995"
## [131] "12/29/1995" "1/5/1996" "1/12/1996" "1/19/1996" "3/29/1996"
## [136] "5/31/1996" "6/21/1996" "7/12/1996" "7/19/1996" "7/26/1996"
## [141] "8/2/1996" "8/9/1996" "8/16/1996" "8/23/1996" "9/27/1996"
## [146] "10/4/1996" "12/6/1996" "2/28/1997" "3/7/1997" "4/18/1997"
## [151] "4/25/1997" "5/2/1997" "6/13/1997" "6/20/1997" "6/27/1997"
## [156] "7/4/1997" "10/10/1997" "10/17/1997" "12/12/1997" "12/19/1997"
## [161] "12/26/1997" "1/9/1998" "1/16/1998" "8/14/1998" "8/21/1998"
## [166] "8/28/1998" "9/18/1998" "9/25/1998" "12/4/1998" "12/11/1998"
## [171] "1/8/1999" "3/12/1999" "4/2/1999" "5/14/1999" "6/25/1999"
## [176] "7/9/1999" "7/30/1999" "8/20/1999" "9/10/1999" "9/24/1999"
## [181] "10/15/1999" "12/31/1999" "3/3/2000" "3/31/2000" "4/7/2000"
## [186] "4/14/2000" "4/21/2000" "7/25/2000" "11/21/2000" "11/28/2000"
## [191] "3/7/2001" "5/30/2001" "7/11/2001" "10/11/2001" "12/13/2001"
## [196] "1/24/2002" "1/31/2002" "8/29/2002" "9/26/2002" "12/26/2002"
## [201] "1/16/2003" "1/23/2003" "3/6/2003" "3/13/2003" "3/20/2003"
## [206] "3/27/2003" "5/1/2003" "6/19/2003" "6/26/2003" "7/3/2003"
## [211] "9/18/2003" "9/25/2003" "10/16/2003" "10/23/2003" "10/30/2003"
## [216] "11/20/2003" "1/1/2004" "2/5/2004" "2/26/2004" "3/4/2004"
## [221] "4/15/2004" "4/22/2004" "5/13/2004" "5/27/2004" "6/3/2004"
## [226] "6/17/2004" "6/24/2004" "7/8/2004" "10/14/2004" "10/21/2004"
## [231] "10/28/2004" "11/18/2004" "12/23/2004" "3/17/2005" "3/24/2005"
## [236] "3/31/2005" "4/7/2005" "7/21/2005" "8/11/2005" "9/22/2005"
## [241] "10/6/2005" "11/3/2005" "12/8/2005" "12/15/2005" "1/20/2006"
## [246] "5/12/2006" "5/19/2006" "5/26/2006" "6/2/2006" "6/9/2006"
## [251] "6/16/2006" "7/7/2006" "7/21/2006" "10/6/2006" "11/3/2006"
## [256] "1/12/2007" "2/23/2007" "3/16/2007" "4/27/2007" "5/25/2007"
## [261] "9/7/2007" "9/14/2007" "9/21/2007" "9/28/2007" "11/16/2007"
## [266] "5/5/2009" "6/1/2010" "6/15/2010" "1/25/2011" "2/1/2011"
## [271] "6/12/2014"
Interpret the plot.
Intercept is always significant, previous OUTPUT should have significnat impact on the next OUTPUT
During 87 to 88, USGG5YR has some insignificant dates, but overall speaking it’s the best predictor among the three.
During 03 and 09 to 14, USGG3M misses more dates as predictors, but prior to 90s, it’s the best predictor of the OUTPUT.
USGG 30YR have lots of insignificant dates until after 2010. So although historically it’s not a best choice for predictors it has become more important since 2010.
Perform PCA with the inputs (columns 1-7).
AssignmentData.Output<-AssignmentData$Output1
AssignmentData<-data.matrix(AssignmentData[,1:7],rownames.force="automatic")
dim(AssignmentData)
## [1] 8300 7
head(AssignmentData)
## USGG3M USGG6M USGG2YR USGG3YR USGG5YR USGG10YR USGG30YR
## 1/5/1981 13.52 13.09 12.289 12.28 12.294 12.152 11.672
## 1/6/1981 13.58 13.16 12.429 12.31 12.214 12.112 11.672
## 1/7/1981 14.50 13.90 12.929 12.78 12.614 12.382 11.892
## 1/8/1981 14.76 14.00 13.099 12.95 12.684 12.352 11.912
## 1/9/1981 15.20 14.30 13.539 13.28 12.884 12.572 12.132
## 1/12/1981 15.22 14.23 13.179 12.94 12.714 12.452 12.082
AD.Predictors.PCA<-princomp(AssignmentData)
Plot factors and their explained relative variance
plot(AD.Predictors.PCA)
barplot(AD.Predictors.PCA$sdev^2/sum(AD.Predictors.PCA$sdev^2),ylim=c(0,1))
The numeric numbers are:
cumsum(AD.Predictors.PCA$sdev^2/sum(AD.Predictors.PCA$sdev^2))
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## 0.9783429 0.9981063 0.9996652 0.9998455 0.9999515 0.9999802 1.0000000
Comp.1 is very significant in explain all the variation among predictors.
Visualize the loadings
(AD.PCALoadings<-AD.Predictors.PCA$loadings)
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## USGG3M -0.384 0.507 0.530 -0.404 0.386
## USGG6M -0.390 0.439 0.111 0.405 -0.679
## USGG2YR -0.415 0.111 -0.419 0.409 0.379 -0.298 0.490
## USGG3YR -0.406 -0.448 0.236 0.198 -0.732
## USGG5YR -0.386 -0.231 -0.246 -0.534 -0.287 0.421 0.439
## USGG10YR -0.348 -0.432 0.150 -0.199 -0.256 -0.736 -0.153
## USGG30YR -0.305 -0.544 0.498 0.421 0.207 0.378
##
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## SS loadings 1.000 1.000 1.000 1.000 1.000 1.000 1.000
## Proportion Var 0.143 0.143 0.143 0.143 0.143 0.143 0.143
## Cumulative Var 0.143 0.286 0.429 0.571 0.714 0.857 1.000
matplot(1:7,AD.Predictors.PCA$loadings,type="l",lty=1:7,lwd=1,xaxt="n",xlab="Predictor",ylab="Factor Loadings",ylim=c(-0.8,0.8),col=c("black","red","blue","green","cyan","orange","magenta"))
abline(h=0)
axis(1, 1:7,labels=colnames(AssignmentData))
legend("topright",legend=c("L1","L2","L3","L4","L5","L6","L7"),lty=1:7,lwd=1,cex=.7,col=c("black","red","blue","green","cyan","orange","magenta"))
Comp 1 has negative loadings from all predictors, it means that it captures all the increase or decrease of all predictors together, component 2 captures the opposite trend of short-term bonds and the long-term bonds, whereas component 3 captures the opposite trend between mid-term bonds and short/long term bonds
Visualize Factors
AD.PCAFactors<-AD.Predictors.PCA$scores
#Plot
matplot(AD.Predictors.PCA$scores,type="l",lty=1:7,lwd=1,ylim=c(-30,20),col=c("black","red","blue","green","cyan","orange","magenta"))
legend("topleft",ncol=2, legend=c("F1","F2","F3","F4","F5","F6","F7"),lty=1:7,lwd=1,col=c("black","red","blue","green","cyan","orange","magenta"))
Visualize correlations
AD.Rotated<-as.data.frame(cbind(Output1=AssignmentData.Output,AD.PCAFactors))
pairs(as.matrix(AD.Rotated))
Output 1 really is strongly negatively correlated to component 1 and not too much with any other components, maybe slightly positively correlated to component 5
Also component 2, component 3 and 5 are positively correlated, and they are all negatively correlated to component 4
(rSqrCorrelations<-apply(AD.Predictors.PCA$scores,2,cor,AssignmentData.Output)^2)
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## 1.000000e+00 1.557347e-24 3.341241e-24 1.293688e-26 5.060602e-24
## Comp.6 Comp.7
## 1.280434e-24 1.941283e-24
Calculate relative importance measures for the PCA factors.
AD.lm.PCA.full<-lm(Output1~., data=AD.Rotated)
metrics.AD.pca <- calc.relimp(AD.lm.PCA.full, type = c("lmg", "first", "last","betasq", "pratt"))
## Warning in rev(variances[[p]]) - variances[[p + 1]]: Recycling array of length 1 in vector-array arithmetic is deprecated.
## Use c() or as.vector() instead.
(metrics.AD.pca@lmg.rank)
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## 1.0 4.5 4.5 4.5 4.5 4.5 4.5
subsets.AD.Rotated<-regsubsets(x=AD.PCAFactors,y=AD.Rotated$Output1)
summary(subsets.AD.Rotated)$which
## (Intercept) Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
## 1 TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
## 2 TRUE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
## 3 TRUE TRUE FALSE TRUE FALSE TRUE FALSE FALSE
## 4 TRUE TRUE FALSE TRUE FALSE TRUE FALSE TRUE
## 5 TRUE TRUE TRUE TRUE FALSE TRUE FALSE TRUE
## 6 TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE
## 7 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Indeed, like we suspected from the correlation charts, component 1 is a must for predicting Output1 and the next useful predictor is component 5
AD.lm.PCA.1<-lm(Output1 ~ Comp.1, data=AD.Rotated)
summary(AD.lm.PCA.1)
##
## Call:
## lm(formula = Output1 ~ Comp.1, data = AD.Rotated)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.011e-09 -3.478e-10 -6.800e-12 3.296e-10 5.029e-09
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.421e-11 1.715e-11 8.290e-01 0.407
## Comp.1 -1.000e+00 1.957e-12 -5.111e+11 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.562e-09 on 8298 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 2.612e+23 on 1 and 8298 DF, p-value: < 2.2e-16
AD.lm.PCA.15<-lm(Output1 ~ Comp.1 + Comp.5, data=AD.Rotated)
summary(AD.lm.PCA.15)
##
## Call:
## lm(formula = Output1 ~ Comp.1 + Comp.5, data = AD.Rotated)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.033e-09 -3.468e-10 -5.000e-12 3.337e-10 5.036e-09
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.421e-11 1.715e-11 8.290e-01 0.407
## Comp.1 -1.000e+00 1.957e-12 -5.111e+11 <2e-16 ***
## Comp.5 2.196e-10 1.880e-10 1.168e+00 0.243
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.562e-09 on 8297 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.306e+23 on 2 and 8297 DF, p-value: < 2.2e-16
After trying out with or without component 5, we can see we only need component 1 as the predictor, reaching Adjusted R-squared equals to 1. This is inevitably overfitting, but it’s a simplification from selecting 1 or 2 or 3 predictors like we did in Step 5.